Python Collections (Sets & Sequences)¶

Python also has built-in types that are compound, meaning that they group together multiple objects.
- These are called collections or containers.
- The objects in a collection are generally referred to as items or elements.
Collections can hold any number of items and even contain items with a mixture of types

There are three categoris of built-in collection types in Python:

Sets
Sequences
Mappings

Streams are another kind of collection whose elements form a series in time, rather than the computer’s memory space.
Modules such as array and collections provide additional collection types

The distinguishing feature of the collection categories is how individual elements are accessed:
- Sets don't allow individual access
- Sequences use numeric indexes
- mappings use keys
- streams just provide one address - "next"
Some types are immutable - once created they cannot be changed i.e. elements cannot be added, removed or reordered.
Primitive types are immutable
Mutable built-in collections cannot be added to sets or used as mapping keys.

Certain basic operations and functions are applicable to all collection types

Sets¶

A set is an unordered collection of items that contains no duplicates. (Mutable)
The type frozenset is an immutable version of set.
Sets can be created by passing a string as the argument to call to set.
Sets can also be created syntactically by enclosing comma-separated values in curly braces ({}). This construction is called a set display.

set('TCAGTTAT')

{'A', 'C', 'G', 'T'}

DNABases = {'T', 'C', 'A', 'G'}

RNABases = {'U', 'C', 'A', 'G'}

DNABases

{'A', 'C', 'G', 'T'}

RNABases

{'A', 'C', 'G', 'U'}

{'TCAG'}

{'TCAG'}

{'TCAG', 'UCAG'}

{'TCAG', 'UCAG'}

{'AATTGC'}

{'AATTGC'}

set('AATTGC')

{'A', 'C', 'G', 'T'}

# Rewrite validate_base_sequence using sets
DNAbases = set('TCAGtcag')
RNAbases = set('UCAGucag')
def validate_base_sequence(base_sequence, RNAflag = False):
    """Return True if the string base_sequence contains only upper- or lowercase
    T (or U, if RNAflag), C, A, and G characters, otherwise False"""
    return set(base_sequence) <= (RNAbases if RNAflag else DNAbases)

validate_base_sequence('tattattat',True)

False

set('tattattat')

{'a', 't'}

RNAbases

{'A', 'C', 'G', 'U', 'a', 'c', 'g', 'u'}

validate_base_sequence('atgcwrqatgc')

False

Sequences¶

Sequences are ordered collections that may contain duplicate elements. (Mutable)
Their elements can be referenced by position.

There are six built-in sequence types:

All sequence types except for ranges and files support the methods count and index .
In addition, all sequence types support the reversed function, which returns a special object that produces the elements of the sequence in reverse order.

Strings, Bytes & Bytearrays¶

Strings are sequences of Unicode characters (though there is no "character" type).
- Unicode is an international standard that assigns a unique number to every character of all major languages (including special "languages" such as musical notation and mathematics), as well as special characters such as diacritics and technical symbols.
The bytes and bytearray types are sequences of single bytes.
bytes and bytearrays have same function as strings but an important difference is -
- Slicing strings returns strings while specifying an element or slice whereas bytes and bytearrays resturns same element while slicing but indexing and element resturns an integer from 0 through 255.
- The bytes and bytearray types serve many purposes, one of which is to efficiently store and manipulate raw data.

Creating¶

str() - returns empty string
str(obj) - Returns a printable representation of obj , as specified by the definition of the type of obj
chr(n) - Returns the one-character string corresponding to the integer n in the Unicode system
ord(char) - Returns the Unicode number corresponding to the one-character string char

str()

''

str(4)

'4'

chr(105)

'i'

ord('D')

68

Testing¶

str1.isalpha() - Returns true if str1 is not empty and all of its characters are alphabetic
str1.isdigit() - Returns true if str1 is not empty and all of its characters are digits
str1.islower() - Returns true if str1 contains at least one "cased" character and all of its cased characters are lowercase
str1.isupper() - Returns true if str1 contains at least one "cased" character and all of its cased characters are uppercase

Searching¶

str1.startswith(str2[, startpos, [endpos]]) - Returns true if str1 starts with str2
str1.endswith(str2[, startpos, [endos]]) - Returns true if str1 ends with str2
str1.find(str2[, startpos[, endpos]]) - Returns the lowest index of str1 at which str2 is found, or -1 if it is not found
str1.index(str2[, startpos[, endpos]]) - Returns the lowest index of str1 at which str2 is found, or ValueError if it is not found
str1.count(str2[, startpos[, endpos]]) - Returns the number of occurrences of str2 in str1

Replacing¶

str1.replace(oldstr, newstr[, count]) - Returns a copy of str1 with all occurrences of the substring oldstr replaced by the string newstr ; if count is specified, only the first count occurrences are replaced.

Changing case¶

str1.lower() - Returns a copy of the string with all of its characters converted to lowercase
str1.upper() - Returns a copy of the string with all of its characters converted to uppercase
str1.capitalize() - Returns a copy of the string with only its first character capitalized; has no effect if the first character is not a letter (e.g., if it is a space)
str1.title() - Returns a copy of the string with each word beginning with an uppercase character and the rest lowercase
str1.swapcase() - Returns a copy of the string with lowercase characters made uppercase and vice versa

Reformating¶

str1.lstrip([chars]) - Returns a copy of str1 with leading characters removed.
str1.rstrip([chars]) - Returns a copy of str1 with trailing characters removed.
str1.strip([chars]) - Returns a copy of str1 with leading and trailing characters removed.
str1.ljust(width[, fillchar]) - Returns str1 left-justified in a new string of length width , "padded" with fillchar (the default fill character is a space).
str1.rjust(width[, fillchar]) - Returns str1 right-justified in a new string of length width , "padded" with fillchar (the default fill character is a space).
str1.center(width[, fillchar]) - Returns str1 centered in a new string of length width , "padded" with fillchar (the default fill character is a space).

format method¶

format(value[, format-specification]) - Returns a string obtained by formatting value according to the formatspecification ; if no format-specification is provided, this is equivalent to str(value).
format-specification.format(posargs, ..., kwdargs, ...) - Returns a string formatted according to the format-specification; any number of positional arguments may be followed by any number of keyword arguments

str1 = 'a string'

'"{0}" contains {1} characters'.format(str1, len(str1))

'"a string" contains 8 characters'

'"{}" contains {} characters'.format(str1, len(str1))

'"a string" contains 8 characters'

'"{string}" contains {length} characters'.format(string=str1, length=len(str1))

'"a string" contains 8 characters'

'"{string}" contains {length} characters'.format(length=len(str1),string=str1)

'"a string" contains 8 characters'

Ranges¶

A range represents a series of integers.

range(stop) - creates a range representing the integers from 0 up to but not including stop .
range(start, stop) - creates a range representing the integers from start up to but not including stop.
range(start, stop,step) - creates a range representing the integers from start up to but not including stop , in increments of step .

range(5)

range(0, 5)

set(range(5))

{0, 1, 2, 3, 4}

set(range(5, 10))

{5, 6, 7, 8, 9}

set(range(5, 10, 2))

{5, 7, 9}

set(range(15, 10, -2))

{11, 13, 15}

set(range(0, -25, -5))

{-20, -15, -10, -5, 0}

Tuples¶

A tuple is an immutable sequence that can contain any type of element.
Tuples are written as comma-separated series of items surrounded by parentheses.
one-element tuples must be written with a comma after their single element.
An empty pair of parentheses creates an empty tuple.
Given a sequence as an argument, the tuple function creates a tuple containing the elements of the sequence.

('TCAG', 'UCAG') # 2 element tuple

('TCAG', 'UCAG')

('TCAG',) # 1 element tuple

('TCAG',)

() # empty tuple

()

('TCAG') # Not a tuple!

'TCAG'

tuple('TCAG')

('T', 'C', 'A', 'G')

tuple(range(5,10))

(5, 6, 7, 8, 9)

Tuple packing & unpacking¶

The righthand side of an assignment statement can be a series of comma-separated expressions. The result is a tuple with those values. This is called tuple packing

bases = 'TCAG', 'UCAG'

bases

('TCAG', 'UCAG')

DNABases, RNABases = 'TCAG', 'UCAG'

DNABases

'TCAG'

RNABases

'UCAG'

def recognition_site(base_seq, recognition_seq):
    return base_seq.find(recognition_seq)

def restriction_cut(base_seq, recognition_seq, offset = 0):
    """Return a pair of sequences derived from base_seq by splitting it at the first appearance
    of recognition_seq; offset, which may be negative, is the number of bases relative to the
    beginning of the site where the sequence is cut"""
    site = recognition_site(base_seq, recognition_seq)
    return base_seq[:site+offset], base_seq[site+offset:]

aseq1 = 'AAAAATCCCGAGGCGGCTATATAGGGCTCCGGAGGCGTAATATAAAA'

left, right = restriction_cut(aseq1, 'TCCGGA')

left

'AAAAATCCCGAGGCGGCTATATAGGGC'

right

'TCCGGAGGCGTAATATAAAA'

a, b = 4, 2

a

4

b

2

a, b = b, a

a

2

b

4

Lists¶

A list is a mutable sequence of any kind of element.
The syntax for lists is a comma-separated series of values enclosed in square brackets.

Deleting¶

del lst[n] - remove the nth element from lst
del lst[i:j] - remove the ith through jth elements from lst
del lst[i:j:k] - remove every k elements from i up to j from lst

list1 = [1,2,3]

list2 = [4,5]

list1 + list2 # Concatenation, list1 & list2 remain unchanged

[1, 2, 3, 4, 5]

list1.extend(list2)

list1

[1, 2, 3, 4, 5]

list2

[4, 5]

Sequence-oriented string methods¶

string.splitlines([keepflg]) - Returns a list of the “lines” in string , splitting at end-of-line characters. If keepflg is omitted or is false, the end-of-line characters are not included in the lines; otherwise, they are.
string.split([sepr[, maxwords]]) - Returns a list of the “words” in string , using sepr as a word delineator. In the special case where sepr is omitted or is None , words are delineated by any consecutive whitespace characters; if maxwords is specified the result will have at most maxwords +1 elements.
string.rsplit([sepr[, maxwords]]) - Performs a reverse split: same as split except that if maxwords is specified and its value is less than the number of words in string the result returned is a list containing the last maxwords+1 words.
sepr.join(seq) - Returns a string formed by concatenating the strings in seq separated by sepr, which can be any string (including the empty string).
string.partition(sepr) - Returns a tuple with three elements: the portion of string up to the first occurrence of sepr , sepr , and the portion of string after the first occurrence of sepr . If sepr is not found in string , the tuple is (string, '', '').
string.rpartition(sepr) - Returns a tuple with three elements: the portion of string up to the last occurrence of sepr , sepr , and the portion of string after the last occurrence of sepr . If sepr is not found in string , the tuple is ('', '', string)

a = [ 'a', 'b', 'c', 'd', 'e']

abc = ','.join(a)

abc

'a,b,c,d,e'

abc.split()

['a,b,c,d,e']

abc.split(',')

['a', 'b', 'c', 'd', 'e']